First, removing any missing data is not part of a data cleaning process. Indeed, no one likes missing data, but it is dangerous to assume that it can simply be removed or replaced. Sometimes missing data tells us something important about whatever it is that we’re measuring (i.e. the value of the variable that is missing may be related to - the reason it is missing). Such data is called Missing not at Random, or MNAR.
Here’s what messy data look like
In the final chapter of this course, you will be presented with a messy, real-world dataset containing an entire year’s worth of weather data from Boston, USA. Among other things, you’ll be presented with variables that contain column names, column names that should be values, numbers coded as character strings, and values that are missing, extreme, and downright erroneous!
# Read weather RDS data
weather <- readRDS("../xDatasets/weather.rds")
# View the first 6 rows of data
weather %>%
head() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| X | year | month | measure | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 | X14 | X15 | X16 | X17 | X18 | X19 | X20 | X21 | X22 | X23 | X24 | X25 | X26 | X27 | X28 | X29 | X30 | X31 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2014 | 12 | Max.TemperatureF | 64 | 42 | 51 | 43 | 42 | 45 | 38 | 29 | 49 | 48 | 39 | 39 | 42 | 45 | 42 | 44 | 49 | 44 | 37 | 36 | 36 | 44 | 47 | 46 | 59 | 50 | 52 | 52 | 41 | 30 | 30 |
| 2 | 2014 | 12 | Mean.TemperatureF | 52 | 38 | 44 | 37 | 34 | 42 | 30 | 24 | 39 | 43 | 36 | 35 | 37 | 39 | 37 | 40 | 45 | 40 | 33 | 32 | 33 | 39 | 45 | 44 | 52 | 44 | 45 | 46 | 36 | 26 | 25 |
| 3 | 2014 | 12 | Min.TemperatureF | 39 | 33 | 37 | 30 | 26 | 38 | 21 | 18 | 29 | 38 | 32 | 31 | 32 | 33 | 32 | 35 | 41 | 36 | 29 | 27 | 30 | 33 | 42 | 41 | 44 | 37 | 38 | 40 | 30 | 22 | 20 |
| 4 | 2014 | 12 | Max.Dew.PointF | 46 | 40 | 49 | 24 | 37 | 45 | 36 | 28 | 49 | 45 | 37 | 28 | 28 | 29 | 33 | 42 | 46 | 34 | 25 | 30 | 30 | 39 | 45 | 46 | 58 | 31 | 34 | 42 | 26 | 10 | 8 |
| 5 | 2014 | 12 | MeanDew.PointF | 40 | 27 | 42 | 21 | 25 | 40 | 20 | 16 | 41 | 39 | 31 | 27 | 26 | 27 | 29 | 36 | 41 | 30 | 22 | 24 | 27 | 34 | 42 | 44 | 43 | 29 | 31 | 35 | 20 | 4 | 5 |
| 6 | 2014 | 12 | Min.DewpointF | 26 | 17 | 24 | 13 | 12 | 36 | -3 | 3 | 28 | 37 | 27 | 25 | 24 | 25 | 27 | 30 | 32 | 26 | 20 | 20 | 25 | 25 | 37 | 41 | 29 | 28 | 29 | 27 | 10 | -6 | 1 |
# View the last 6 rows of data
weather %>%
tail() %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| X | year | month | measure | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 | X14 | X15 | X16 | X17 | X18 | X19 | X20 | X21 | X22 | X23 | X24 | X25 | X26 | X27 | X28 | X29 | X30 | X31 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 281 | 281 | 2015 | 12 | Mean.Wind.SpeedMPH | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 282 | 282 | 2015 | 12 | Max.Gust.SpeedMPH | 17 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 283 | 283 | 2015 | 12 | PrecipitationIn | 0.14 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 284 | 284 | 2015 | 12 | CloudCover | 7 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 285 | 285 | 2015 | 12 | Events | Rain | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 286 | 286 | 2015 | 12 | WindDirDegrees | 109 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
# View a condensed summary of the data
str(weather)## 'data.frame': 286 obs. of 35 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ month : int 12 12 12 12 12 12 12 12 12 12 ...
## $ measure: chr "Max.TemperatureF" "Mean.TemperatureF" "Min.TemperatureF" "Max.Dew.PointF" ...
## $ X1 : chr "64" "52" "39" "46" ...
## $ X2 : chr "42" "38" "33" "40" ...
## $ X3 : chr "51" "44" "37" "49" ...
## $ X4 : chr "43" "37" "30" "24" ...
## $ X5 : chr "42" "34" "26" "37" ...
## $ X6 : chr "45" "42" "38" "45" ...
## $ X7 : chr "38" "30" "21" "36" ...
## $ X8 : chr "29" "24" "18" "28" ...
## $ X9 : chr "49" "39" "29" "49" ...
## $ X10 : chr "48" "43" "38" "45" ...
## $ X11 : chr "39" "36" "32" "37" ...
## $ X12 : chr "39" "35" "31" "28" ...
## $ X13 : chr "42" "37" "32" "28" ...
## $ X14 : chr "45" "39" "33" "29" ...
## $ X15 : chr "42" "37" "32" "33" ...
## $ X16 : chr "44" "40" "35" "42" ...
## $ X17 : chr "49" "45" "41" "46" ...
## $ X18 : chr "44" "40" "36" "34" ...
## $ X19 : chr "37" "33" "29" "25" ...
## $ X20 : chr "36" "32" "27" "30" ...
## $ X21 : chr "36" "33" "30" "30" ...
## $ X22 : chr "44" "39" "33" "39" ...
## $ X23 : chr "47" "45" "42" "45" ...
## $ X24 : chr "46" "44" "41" "46" ...
## $ X25 : chr "59" "52" "44" "58" ...
## $ X26 : chr "50" "44" "37" "31" ...
## $ X27 : chr "52" "45" "38" "34" ...
## $ X28 : chr "52" "46" "40" "42" ...
## $ X29 : chr "41" "36" "30" "26" ...
## $ X30 : chr "30" "26" "22" "10" ...
## $ X31 : chr "30" "25" "20" "8" ...
Exploring raw data
Getting a feel for your data
The first thing to do when you get your hands on a new dataset is to understand its structure. There are several ways to go about this in R, each of which may reveal different issues with your data that require attention.
In this course, we are only concerned with data that can be expressed in table format (i.e. two dimensions, rows and columns). As you may recall from earlier courses, tables in R often have the type data.frame. You can check the class of any object in R with the class() function.
Once you know that you are dealing with tabular data, you may also want to get a quick feel for the contents of your data. Before printing the entire dataset to the console, it’s probably worth knowing how many rows and columns there are. The dim() command tells you this.
# Read BMI data
library(readr)
bmi <- read_csv("../xDatasets/bmi_clean.csv")
# Check the class of bmi
class(bmi)## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
# Check the dimensions of bmi
dim(bmi)## [1] 199 30
# View the column names of bmi
names(bmi)## [1] "Country" "Y1980" "Y1981" "Y1982" "Y1983" "Y1984" "Y1985"
## [8] "Y1986" "Y1987" "Y1988" "Y1989" "Y1990" "Y1991" "Y1992"
## [15] "Y1993" "Y1994" "Y1995" "Y1996" "Y1997" "Y1998" "Y1999"
## [22] "Y2000" "Y2001" "Y2002" "Y2003" "Y2004" "Y2005" "Y2006"
## [29] "Y2007" "Y2008"
Viewing the structure of your data
Since bmi doesn’t have a huge number of columns, you can view a quick snapshot of your data using the str() (for structure) command. In addition to the class and dimensions of your entire dataset, str() will tell you the class of each variable and give you a preview of its contents.
Although we won’t go into detail on the dplyr package in this lesson (see the Data Manipulation in R with dplyr course), the glimpse() function from dplyr is a slightly cleaner alternative to str(). str() and glimpse() give you a preview of your data, which may reveal issues with the way columns are labelled, how variables are encoded, etc.
# Check the structure of bmi
str(bmi)## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 199 obs. of 30 variables:
## $ Country: chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ Y1980 : num 21.5 25.2 22.3 25.7 20.9 ...
## $ Y1981 : num 21.5 25.2 22.3 25.7 20.9 ...
## $ Y1982 : num 21.5 25.3 22.4 25.7 20.9 ...
## $ Y1983 : num 21.4 25.3 22.5 25.8 20.9 ...
## $ Y1984 : num 21.4 25.3 22.6 25.8 20.9 ...
## $ Y1985 : num 21.4 25.3 22.7 25.9 20.9 ...
## $ Y1986 : num 21.4 25.3 22.8 25.9 21 ...
## $ Y1987 : num 21.4 25.3 22.8 25.9 21 ...
## $ Y1988 : num 21.3 25.3 22.9 26 21 ...
## $ Y1989 : num 21.3 25.3 23 26 21.1 ...
## $ Y1990 : num 21.2 25.3 23 26.1 21.1 ...
## $ Y1991 : num 21.2 25.3 23.1 26.2 21.1 ...
## $ Y1992 : num 21.1 25.2 23.2 26.2 21.1 ...
## $ Y1993 : num 21.1 25.2 23.3 26.3 21.1 ...
## $ Y1994 : num 21 25.2 23.3 26.4 21.1 ...
## $ Y1995 : num 20.9 25.3 23.4 26.4 21.2 ...
## $ Y1996 : num 20.9 25.3 23.5 26.5 21.2 ...
## $ Y1997 : num 20.8 25.3 23.5 26.6 21.2 ...
## $ Y1998 : num 20.8 25.4 23.6 26.7 21.3 ...
## $ Y1999 : num 20.8 25.5 23.7 26.8 21.3 ...
## $ Y2000 : num 20.7 25.6 23.8 26.8 21.4 ...
## $ Y2001 : num 20.6 25.7 23.9 26.9 21.4 ...
## $ Y2002 : num 20.6 25.8 24 27 21.5 ...
## $ Y2003 : num 20.6 25.9 24.1 27.1 21.6 ...
## $ Y2004 : num 20.6 26 24.2 27.2 21.7 ...
## $ Y2005 : num 20.6 26.1 24.3 27.3 21.8 ...
## $ Y2006 : num 20.6 26.2 24.4 27.4 21.9 ...
## $ Y2007 : num 20.6 26.3 24.5 27.5 22.1 ...
## $ Y2008 : num 20.6 26.4 24.6 27.6 22.3 ...
## - attr(*, "spec")=
## .. cols(
## .. Country = col_character(),
## .. Y1980 = col_double(),
## .. Y1981 = col_double(),
## .. Y1982 = col_double(),
## .. Y1983 = col_double(),
## .. Y1984 = col_double(),
## .. Y1985 = col_double(),
## .. Y1986 = col_double(),
## .. Y1987 = col_double(),
## .. Y1988 = col_double(),
## .. Y1989 = col_double(),
## .. Y1990 = col_double(),
## .. Y1991 = col_double(),
## .. Y1992 = col_double(),
## .. Y1993 = col_double(),
## .. Y1994 = col_double(),
## .. Y1995 = col_double(),
## .. Y1996 = col_double(),
## .. Y1997 = col_double(),
## .. Y1998 = col_double(),
## .. Y1999 = col_double(),
## .. Y2000 = col_double(),
## .. Y2001 = col_double(),
## .. Y2002 = col_double(),
## .. Y2003 = col_double(),
## .. Y2004 = col_double(),
## .. Y2005 = col_double(),
## .. Y2006 = col_double(),
## .. Y2007 = col_double(),
## .. Y2008 = col_double()
## .. )
# Load dplyr
library(dplyr)
# Check the structure of bmi, the dplyr way
glimpse(bmi)## Observations: 199
## Variables: 30
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angol...
## $ Y1980 <dbl> 21.48678, 25.22533, 22.25703, 25.66652, 20.94876, 23.3...
## $ Y1981 <dbl> 21.46552, 25.23981, 22.34745, 25.70868, 20.94371, 23.3...
## $ Y1982 <dbl> 21.45145, 25.25636, 22.43647, 25.74681, 20.93754, 23.4...
## $ Y1983 <dbl> 21.43822, 25.27176, 22.52105, 25.78250, 20.93187, 23.5...
## $ Y1984 <dbl> 21.42734, 25.27901, 22.60633, 25.81874, 20.93569, 23.6...
## $ Y1985 <dbl> 21.41222, 25.28669, 22.69501, 25.85236, 20.94857, 23.7...
## $ Y1986 <dbl> 21.40132, 25.29451, 22.76979, 25.89089, 20.96030, 23.8...
## $ Y1987 <dbl> 21.37679, 25.30217, 22.84096, 25.93414, 20.98025, 23.9...
## $ Y1988 <dbl> 21.34018, 25.30450, 22.90644, 25.98477, 21.01375, 24.0...
## $ Y1989 <dbl> 21.29845, 25.31944, 22.97931, 26.04450, 21.05269, 24.1...
## $ Y1990 <dbl> 21.24818, 25.32357, 23.04600, 26.10936, 21.09007, 24.2...
## $ Y1991 <dbl> 21.20269, 25.28452, 23.11333, 26.17912, 21.12136, 24.3...
## $ Y1992 <dbl> 21.14238, 25.23077, 23.18776, 26.24017, 21.14987, 24.4...
## $ Y1993 <dbl> 21.06376, 25.21192, 23.25764, 26.30356, 21.13938, 24.5...
## $ Y1994 <dbl> 20.97987, 25.22115, 23.32273, 26.36793, 21.14186, 24.6...
## $ Y1995 <dbl> 20.91132, 25.25874, 23.39526, 26.43569, 21.16022, 24.6...
## $ Y1996 <dbl> 20.85155, 25.31097, 23.46811, 26.50769, 21.19076, 24.7...
## $ Y1997 <dbl> 20.81307, 25.33988, 23.54160, 26.58255, 21.22621, 24.7...
## $ Y1998 <dbl> 20.78591, 25.39116, 23.61592, 26.66337, 21.27082, 24.8...
## $ Y1999 <dbl> 20.75469, 25.46555, 23.69486, 26.75078, 21.31954, 24.9...
## $ Y2000 <dbl> 20.69521, 25.55835, 23.77659, 26.83179, 21.37480, 24.9...
## $ Y2001 <dbl> 20.62643, 25.66701, 23.86256, 26.92373, 21.43664, 25.0...
## $ Y2002 <dbl> 20.59848, 25.77167, 23.95294, 27.02525, 21.51765, 25.1...
## $ Y2003 <dbl> 20.58706, 25.87274, 24.05243, 27.12481, 21.59924, 25.2...
## $ Y2004 <dbl> 20.57759, 25.98136, 24.15957, 27.23107, 21.69218, 25.2...
## $ Y2005 <dbl> 20.58084, 26.08939, 24.27001, 27.32827, 21.80564, 25.3...
## $ Y2006 <dbl> 20.58749, 26.20867, 24.38270, 27.43588, 21.93881, 25.5...
## $ Y2007 <dbl> 20.60246, 26.32753, 24.48846, 27.53363, 22.08962, 25.6...
## $ Y2008 <dbl> 20.62058, 26.44657, 24.59620, 27.63048, 22.25083, 25.7...
# View a summary of bmi
sum_bmi <- as.data.frame(do.call(cbind, lapply(bmi, summary)))
sum_bmi[,-1] %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| Y1980 | Y1981 | Y1982 | Y1983 | Y1984 | Y1985 | Y1986 | Y1987 | Y1988 | Y1989 | Y1990 | Y1991 | Y1992 | Y1993 | Y1994 | Y1995 | Y1996 | Y1997 | Y1998 | Y1999 | Y2000 | Y2001 | Y2002 | Y2003 | Y2004 | Y2005 | Y2006 | Y2007 | Y2008 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. | 19.01394 | 19.03902 | 19.06804 | 19.09675 | 19.13046 | 19.16397 | 19.1974 | 19.23481 | 19.2709 | 19.31105 | 19.3515 | 19.39625 | 19.45212 | 19.51493 | 19.58757 | 19.66996 | 19.71305 | 19.7424 | 19.76632 | 19.79587 | 19.82768 | 19.86357 | 19.83696 | 19.80717 | 19.78927 | 19.78781 | 19.8018 | 19.8291 | 19.86692 |
| 1st Qu. | 21.27082 | 21.30784 | 21.362375 | 21.41988 | 21.44577 | 21.466555 | 21.492555 | 21.501815 | 21.51734 | 21.548785 | 21.57262 | 21.599555 | 21.6513 | 21.743 | 21.758025 | 21.82634 | 21.89233 | 21.93932 | 21.99765 | 22.04246 | 22.11507 | 22.21658 | 22.291485 | 22.368735 | 22.44919 | 22.53678 | 22.6267 | 22.72645 | 22.832135 |
| Median | 23.31424 | 23.39054 | 23.46016 | 23.56861 | 23.63584 | 23.73109 | 23.82157 | 23.87321 | 23.92801 | 24.0271 | 24.14473 | 24.1979 | 24.19453 | 24.27485 | 24.35741 | 24.40843 | 24.41998 | 24.49745 | 24.4885 | 24.60608 | 24.6554 | 24.73358 | 24.80671 | 24.88965 | 25.00262 | 25.11463 | 25.23747 | 25.36476 | 25.49887 |
| Mean | 23.1547664321608 | 23.2102073869347 | 23.2648034673367 | 23.3172157286432 | 23.3705114572864 | 23.423494321608 | 23.4772568844221 | 23.5319515577889 | 23.5889472361809 | 23.6470072864322 | 23.7052090452261 | 23.7631105527638 | 23.8210563819095 | 23.8792342713568 | 23.9383301005025 | 24.001183718593 | 24.0682096984925 | 24.1390354773869 | 24.2115721105528 | 24.2864309045226 | 24.3642196984925 | 24.4430578894472 | 24.5241371859296 | 24.608454120603 | 24.6985252763819 | 24.7921330150754 | 24.8903988944724 | 24.9926272361809 | 25.0969846733668 |
| 3rd Qu. | 24.82028 | 24.88683 | 24.94037 | 25.01943 | 25.056045 | 25.10566 | 25.20339 | 25.27423 | 25.335225 | 25.370375 | 25.38608 | 25.41697 | 25.47902 | 25.539305 | 25.6153 | 25.697265 | 25.778105 | 25.85415 | 25.936495 | 26.013855 | 26.086555 | 26.185345 | 26.295505 | 26.38302 | 26.473265 | 26.527285 | 26.590595 | 26.66216 | 26.82232 |
| Max. | 28.12449 | 28.35509 | 28.58248 | 28.81528 | 29.04548 | 29.28105 | 29.52002 | 29.75319 | 29.97877 | 30.20436 | 30.42198 | 30.63617 | 30.84639 | 31.04119 | 31.23113 | 31.41477 | 31.59122 | 31.76624 | 31.94707 | 32.13097 | 32.31834 | 32.50635 | 32.70215 | 32.89697 | 33.09517 | 33.2964 | 33.49282 | 33.69373 | 33.89634 |
Visualizing your data
There are many ways to visualize data. Since this is not a course about data visualization, we will only touch on two types of plots that may be useful for quickly identifying extreme or suspicious values in your data: histograms and scatter plots.
A histogram, created with the hist() function, takes a vector (i.e. column) of data, breaks it up into intervals, then plots as a vertical bar the number of instances within each interval. A scatter plot, created with the plot() function, takes two vectors (i.e. columns) of data and plots them as a series of (x, y) coordinates on a two-dimensional plane.
# Histogram of BMIs from 2008
hist(bmi$Y2008)# Scatter plot comparing BMIs from 1980 to those from 2008
plot(bmi$Y1980, bmi$Y2008)